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Cancer is one of the diseases with the highest mortality rate in the world. 
Cancer is a disease when abnormal cells grow out of control that can attack 
the body's organs side by side or spread to other organs. Lung cancer is a 
condition when malignant cells form in the lungs. To diagnose lung cancer 
can be done by taking x-ray images, CT scans, and lung tissue biopsy. In this 


modern era, technology is expected to help research in the field of health. 


Therefore, in this study feature extraction from CT images was used as data 
Keywords: to classify lung cancer. We used CT scan image data from SPIE-AAPM 
Lung CT challenge 2015. Fuzzy C-Means and fuzzy kernel C-Means were 
used to classify the lung nodule from the patient into benign or malignant. 
Fuzzy kernel ene Fuzzy C-Means is a soft clustering method that uses Euclidean distance to 
Image classification calculate the cluster center and membership matrix. Whereas fuzzy kernel C- 
Lung nodule Means uses kernel distance to calculate it. In addition, the support vector 
Machine learning machine was used in another study to obtain 72% average AUC. Simulations 
were performed using different k-folds. The score showed fuzzy kernel C- 
Means had the highest accuracy of 74%, while fuzzy C-Means obtained 73% 
accuracy. 
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1. INTRODUCTION 

Lung cancer is the uncontrollable formation of the malignant cells in the lungs [1]. According to 
medical analysis, 1 out of every 20 people diagnosed with this disease lives up to at least 10 years, while | in 
every 3 persons die within a year [2]. However, patient's survival rates vary widely, and early diagnosis 
makes a huge difference. The diagnosis procedure is carried out using a Rontgen picture, CT scan, and lung 
tissue biopsy. From the three tests, the doctor easily determines the cancer type and stage [1]. A spot on a 
lung CT scan is defined as a nodule that is either a benign or malignant [3]. Radiologists are often mentally 
burdened and fatigued due to the act of examining many images in a day, which may impact their ability to 
determine and classify a tumor correctly [4]. Therefore, this study used a computed tomography scanning 
(CT scan) or magnetic resonance imaging (MRI) to classify patients. The clustering analysis was used in 
classifying a set of data into clusters [5]. It is an unsupervised learning method used to classify several 
objects into similar and dissimilar groups [6]. One of the most popular clustering methods is fuzzy C-Means, 
and by applying the kernel, the fuzzy kernel C-Means is obtained. In 2015, the SPIE medical imaging 
conference carried out a “Grand Challenge” called LUNGx. This event was supported by the american 
association of physicists in medicine (AAPM) and the national cancer institute (NCI). The challenge was 
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used to determine the best methods used to classify malignant and benign lung nodules based on a 
quantitative image available on their website [7]. This study, therefore, aims to classify lung cancer based on 
LUNGX SPIE AAPM data using fuzzy C-Means and fuzzy kernel C-Means clustering algorithm. In addition, 
previous research on the classification of lung nodules was performed with various methods such as 
convolutional neural network [8]-[10], support vector machine [11], and semi-supervised adversarial model 
[12]. The fuzzy C-Means method was initially used to classify thalassemia data [13], breast cancer [14], and 
intrusion detection system [15], while the fuzzy kernel C-Means was for chronic sinusitis [16], insolvency 
prediction [17], direction and indonesian stock price movement [18]. 


2. RESEARCH METHOD 

In this research, 70 CT scan data of patients with each consisting of more than 200 CT scan image, 
were used to classify the data using an algorithm. The obtained results showed that each patient had at least 1 
lung nodule. Therefore, a total of 83 lung nodules were obtained, cropped, converted into numerical data, and 
classified. The following are the various classification steps. 


2.1. Image preprocessing 

First, the image is cropped from 512x512 to 64x64 pixels using Python 3.7, following the lung 
nodule coordinates (x, y, instance number) of each patient. While the program is running, the 70 patient 
image data is automatically converted into 83 pieces of lung nodule grayscale using tiff format (.tiff). 
Extraneous bodies excluded from the lung nodule were partially removed by running a manual thresholding 
Python script. Furthermore, the GIMP application is manually used to remove the remaining non-nodule 
parts and changed to black (0 pixels). The aim is to clarify the lung nodule without changing the pixel size of 
the image (64x64). The preprocessing step is shown in Figure 1. 


Figure 1. Preprocessing step to crop lung nodule from the lung CT image 


2.2. Feature extraction 

After preprocessing the image, the data is converted to numeric using the value of feature extraction. 
This is followed by inputting the extracted data frame into the row and column of the patient's lung nodule. 
The following features are used: 
a. Nodule size 

This is the size or area of the nodule denoted by pixels. 
b. GLCM 

The gray level co-occurrence matrix (GLCM) is made for quantification of the heterogeneity of 
surface patterns and roughness displayed on digital images, created by Robert Haralick, a computer scientist 
[19]. It enables certain properties of texture, such as bumpiness, irregularity, and smoothness, to be 
highlighted by each index [20]. The texture is a term commonly used to characterize the gray-level variations 
of an image [21]. The GLCM features that used in this study are contrast, homogeneity, angular second 
moment (ASM), and energy [19]. The following are the formula of the features: 


Contrast = Yi ja0 P, ;(i-j)? _ 


N-1 Pi (2) 


Homogeneity = Xi j=0 T¢G_-p2 
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ASM = Yijzo Pi” (3) 
Energy = VASM (4) 


Pi; = value in row i and column j from GLCM 
N = total rows or column on GLCM 


c. LBP 

Local Binary Pattern is usually applied in the 3x3 pixel image. It is a productive and effective 
method used for image processing, and locally repeated patterns are revealed by using this method [22]. It is 
also used to re-encode the central value of the 3x3 pixel images [22]. The LBP feature that used in this study 
is LBP energy. The formula is (5): 


LBP Energy = [D0 Pi (5) 


i,j=0 


Pi; = value in row i and column j from LBP histogram 
N = total bins on LBP histogram 


Furthermore, the data frame is standardized using the scikit-learn python module with a mean value 
of 0, and a standard deviation of 1. In previous data, a range of tens to thousands were converted to less than 
1 for the algorithm to run significantly. The following is the data displayed after standardization: 

According to Table 1, number of pixels column denote the nodule size. GLCM contrast, GLCM 
homogeneity, GLCM ASM, and GLCM energy are features that are produced by GLCM method. LBP 
energy is the feature of LBP method. 


Table 1. The data frame after standardized 


No. Patient Number of Pixels GLCM Contrast | GLCM Homogeneity —~GLCM ASM —_GLCM Energy LBP Energy 


0 -L.17 -0.99 1.20 1.34 1.21 1.39 
1 -1.14 -0.91 1.18 1.31 1.18 1.35 
2 -0.37 -0.63 0.27 0.27 0.30 0.30 
3 0.16 0.51 -0.11 -0.19 -0.12 -0.24 
4 0.58 -0.40 -0.55 -0.68 -0.59 -0.74 


2.3. Fuzzy c-means and fuzzy kernel c-means 

A data frame with features used for classification is generated after the pre-processing CT image is 
completed. Furthermore, an unsupervised learning method is used to cluster and categorize the patient cancer 
into benign or malignant using the fuzzy c-means [23]. Fuzzy C-Means classification’s accuracy reckons on 
the data types. The classification convergence is slow and inaccurate, assuming the data is not linearly 
separated, with the kernel used for correction. A data set is transformed into a new feature space using a 
kernel with a higher space [24]. Therefore, the non-linear problem generalized, in combination with linear 
models, is overcome [13]. Let x € R” is the original data set. To transform data set in R” into a new feature 
space F, a function ¢ is used [25]: 


pSReSE (6) 
The kernel function is defined as [26]: 


K(x,y) = (p(x), P(Y)) (7) 


And the distance of kernel is [17], 


d?(x,y) = llep(x) - eO)II? 
= p(x) p(x) — 29(x)p(y) + (vy) 90) 
= K(x,x) — 2K(x,y) + K(y,y) (8) 


In this study, we use the RBF Kernel [13]: 
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K(x,y) = exp (-P2) 9) 


202 
where, 
K(x,x) =KQy,y) =1 (10) 
Hence, 
d?(x,y) = 2(1 - K(x, y)) (11) 


In this study, by applying a kernel to the fuzzy C-Means method, the insolvency problem is solved 
using the fuzzy kernel C-Means. For a data set X = {x1,%2,...,Xn} & R¢, n xX c membership matrix U = 
[uj], 1 <j <n, 1 <i <nand cluster center V = {v,,V2,...,V,} where every object in V is a part of d- 
dimensional Euclidean Space [24]. Their objective functions are as [13], [14]: 


2 
Im = Lha1 Di=1 uij||9 @) — || (12) 

where m > 1 € R is the fuzzifier, with constraints: 
Lia = 1, wherei = 1,2,...,n (13) 


at uj; > 0, where j = 1,2,...,¢ 
uy e [0,1], where j = 1,2,...,c (14) 


The algorithm of fuzzy kernel C-Means is shown in Figure 2: 


1). For t= 1toT, let ys? isthe cluster centers, while t = 0 is the initial center, j = 1,2,...,c: 
2). UsngRBF Kemel to calculate the value of the distance between x; andv; 

lle) — eC Il = 2 (1 — K(x;,v;)) = a2(x;,y)) 
3). Calculate the m em bership value 


© (d2(x;,v,)\™! 
uy = om (==) where m> 1 


1 \d7(x;,v%) 
4). Update the cluster centers 


5).If ly — x" || <2 or T =t, STOP ELSE 
6). Go back to step (2). 


Figure 2. Algorithm of fuzzy kernel C-Means 


2.4. Model performance validation 

In this study, simulations were performed with different k-fold using k-fold cross-validation. Data is 
separated into training and test with equal size approximation [27], [28]. The performance evaluation is 
measured by accuracy, precision, recall, specificity, and fl score. Let TP, TN, FP, and FN denote true 
positive, true negative, false positive, and false negative. The formulas as (15-19) [29]: 


Accuracy = —~**" __ (15) 
TP+FP+TN+FN 
Precision = —/— (16) 
TP+FP 
Recall = —?— (17) 
TP+FN 
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Specificity = — (18) 
FiScorae 2 x Precision x Recall (19) 


Precision+Recall 


3. RESULTS AND ANALYSIS 

This study used Python 3.7 to run the program: 

According to Table 2, fuzzy C-Means classification with k = 2, an accuracy of 73.2%, a precision 
of 80%, and f1 score of 68.6% are recorded as the highest accuracy, precision, and fl score. When k = 5, a 
recall of 62.5% and a specificity of 87.5% are recorded as the highest recall and specificity. 


Table 2. Results of Lung classification using fuzzy C-Means 
K-Fold (k) Accuracy (%) Precision (%) Recall (%) Specificity (%) _ Fl Score (%) 
85.7 


2 73.2 80 60 68.6 
3 66.7 75 54.8 85.7 57.1 
4 65 71.4 50 80 58.8 
5 62.5 75 62.5 87.5 50 
6 69.2 75 50 85.7 60 
7 63.6 60 60 66.7 60 
8 60 60 60 60 60 
9 50 50 50 50 50 


According to Table 3, fuzzy Kernel C-Means classification with k = 3, an accuracy of 74.1%, a 
recall of 69.2%, and fl score of 72% are recorded as the highest accuracy, precision, and fl score of fuzzy 
kernel C-Means. When k = 2, a precision of 80% and a specificity of 85.7% are recorded as the highest 
precision and specificity. 


Table 3. Results of Lung nodule classification using fuzzy kernel C-Means with RBF Kernel and o = 1 


K-Fold (k) Accuracy (%) Precision (%) Recall (%) Specificity (%) Fl Score (%) 

2 73.2 80 60 85.7 68.6 
3 74.1 75 69.2 78.6 72 

4 65 66.7 60 70 63.2 
5 62.5 62.5 62.5 62.5 62.5 
6 61.5 57.1 66.7 57.1 61.5 
7 54.5 50 60 50 54.5 
8 50 50 60 60 54 

9 62.5 66.7 50 715 57.1 


From the data above, we can conclude that the best accuracy, recall, and fl score is achieved by 
fuzzy kernel C-Means, with a 74.1% accuracy, a 69.2% recall, and a 72% f1 score. The best specificity is 
achieved by fuzzy C-Means, with an 87.5% specificity. The best precision is achieved by both classifiers, 
with an 80% precision. These result show that fuzzy kernel C-Means is better than fuzzy C-Means for lung 
cancer classification. However, this result cannot be generalized for different data or different optimization 
parameters. Consequently, the limitation of the problem in this study are the data used and optimization 
parameters. 


4. CONCLUSION 

This research used LUNGX CT image data from Lungx Challenge hosted by the SPIE-AAPM-NCI 
in 2015. After converting the CT image data into numeric using extraction features such as lung nodule size, 
gray level co-occurrence matrices (GLCM), and local binary pattern (LBP), it was able to classify lung 
nodule into benign or malignant. In addition, the data set is separated into training and test, using K-fold 
cross-validation, while fuzzy C-Means and fuzzy kernel C-Means were used for classification. According to 
the simulation, the evaluation performance of the model is conducted by accuracy, precision, recall, 
specificity, and fl score. For each simulation using Python 3.7, the best accuracy, recall, and fl score are 
achieved by fuzzy kernel C-Means, with a 74.1% accuracy, a 69.2% recall, and a 72% f1 score. However, the 
best specificity of 87.5% is achieved by fuzzy C-Means and best precision of 80% is equally achieved by 
both classifiers. These results show that the use of the kernel in the fuzzy C-Means method can improve its 
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performance. Therefore, the performance of fuzzy kernel C-Means is better than fuzzy C-Means with the 
data used and optimization parameters as limitations. 
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